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A common metaphor for describing development is a rugged epigenetic landscape where cell 
fates are represented as attracting valleys resulting from a complex regulatory network. Here, we 
introduce a framework for explicitly constructing epigenetic landscapes that combines genomic data 
with techniques from physics. Each cell fate is a dynamic attractor, yet cells can change fate in 
response to external signals. Our model suggests that partially reprogrammed cells are a natural 
consequence of high-dimensional landscapes and predicts that partially reprogrammed cells should 
be hybrids that co-express genes from multiple cell fates. We verify this prediction by reanalyzing 
existing data sets. Our model reproduces known reprogramming protocols and identifies candidate 
transcription factors for reprogramming to novel cell fates, suggesting epigenetic landscapes are a 
powerful paradigm for understanding cellular identity. 



Understanding the molecular basis of cellular identity 
and differentiation is a major goal of modern biology. 
This is especially true in light of the work of Takahashi 
and Yamanaka demonstrating that the overexpression of 
just four transcription factors (TFs) is sufficient to con- 
vert somatic fibroblasts into cells resembling embryonic 
stem cells (ESCs), dubbed induced pluripotcnt stem cells 
(iPSCs) [1]. The idea of using a small set of TFs to re- 
program cell fate has proven to be extremely versatile 
and reprogramming protocols now exist for generating 
neurons [2] , cardiomyocytes [3] , liver cells [4, 5] , and hu- 
man iPSCs [6] . Despite these revolutionary experimental 
advances, cell fate is still poorly understood mechanisti- 
cally and theoretically. Recent experiments suggest cell 
fates can be viewed as high-dimensional attractor states 
of the gene regulatory networks underlying cellular iden- 
tity [7]. In particular, cell fates are characterized by 
a robust gene expression and epigenetic state resulting 
from the complex interplay of transcriptional regulation, 
chromatin regulators, non-coding and micro RNAs, and 
signal transduction pathways. 

These experiments have renewed interest in the idea of 
an 'epigenetic landscape' that underlies cellular identity 
[8-11]. In the landscape metaphor, cell fates are viewed 
as attracting basins in a rugged landscape and differen- 
tiation proceeds through signal-dependent 'low-energy' 



valleys that connect cell fates (see Figure 1). Cells can 
also change fates due to probabilistic barrier crossings 
between valleys as in cellular rcprogramming[12, 13]. 

Traditionally, epigenetic landscapes have been mod- 
eled as low-dimensional systems using either a small sub- 
set of genes [14] or a few cell fates [15]. These methods 
cannot easily be scaled to higher dimensions. Instead, 
inspired in part by the success of statistical energy land- 
scapes in protein folding [16], we avoid these limitations 
by combining large-scale genomic data with techniques 
from spin glass physics and neural networks [17-20]. Us- 
ing only data of mouse microarray gene expression states 
as input, we construct a parameter-free epigenetic land- 
scape model for cellular identity with 95 stable cell fates 
and 1152 TFs. Each cell fate is a robust attractor, yet 
cells can deterministically switch fates in response to ex- 
ternal signals. Our model provides a unified framework 
to discuss differentiation and reprogramming. It also nat- 
urally explains the existence of partially reprogrammed 
cell fates (stable cell fates found in reprogramming ex- 
periments but not in vivo) as 'spurious' attractors re- 
sulting from the high dimensionality of the landscape. 
Our model predicts, and we verify, that partially repro- 
grammed cells are hybrids that co-express TFs of mul- 
tiple naturally occurring cell fates. Finally, our model 
reproduces known reprogramming protocols to iPSCs, 
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heart, and liver, and can be used for designing repro- 
gramming protocols to novel cell fates. Taken together, 
these results suggest that epigenetic landscapes represent 
a powerful framework for understanding the molecular 
circuitry and dynamics that gives rise to cell fate. 

Any landscape model must reproduce several exper- 
imental observations. Most importantly, all cell fates 
must be robust attractors, yet allow cells to change fate 
through rare stochastic transitions as in cellular repro- 
gramming experiments [21, 22]. A common result of re- 
programming is not the desired cell fate, but partially re- 
programmed cells [23, 24]. These results suggest that the 
landscape is rugged and may contain additional spurious 
attractors corresponding to cell fates that do not natu- 
rally occur in vivo. In addition, environmental and ex- 
ternal signals can control cell fates. Some environments 
stabilize particular cell fates (Fig. IB). A dramatic ex- 
ample of this is a protocol for rcprogramming to neural 
progenitor cells (NPCs) that is identical to Yamanaka's 
protocol for reprogramming to ESC except for the cultur- 
ing media [25]. Other external signals deterministically 
switch cell fates, as occurs in normal development (Fig. 
1C) [26]. Together, these imply the landscape is a dy- 
namic entity that depends on environmental signals, and 
we will show that our landscape successfully incorporates 
all of these experimental observations. 

CONSTRUCTING EPIGENETIC LANDSCAPES 

The epigenetic state space 

Cellular identity and differentiation are largely con- 
trolled by epigenetics, especially histone modifications 
(HMs)[27] (Fig. 2A). Consequently, the ideal data set 
for constructing epigenetic landscapes are the genome- 
wide HM states for genes in all cell fates. However, un- 
like microarrays, global HM data are limited to a few 
cell fates [28, 29]. To circumnavigate this problem, wc 
used available HM data and compared them to microar- 
ray gene expression levels for available cell fates; this 
created a conditional probability distribution of having 
a HM given a TF expression level (Fig. 2B). We found 
a sharp threshold which distinguished genes with the ac- 
tivating modification of histone 3 tri-methylation at ly- 
sine 4 (K4) from genes with the inactivating modifica- 
tion of histone 3 tri-methylation at lysine 27 (K27) and 
poised/bivalent genes (both K4 and K27). This thresh- 
old implies that continuous TF expression levels can be 
used to infer the discrete HM states. We then used 393 
relevant whole genome microarrays (details in SI) to cre- 
ate a binary TF state for N = 1152 TFs (labeled by Latin 
indices i, j) in p = 95 cell fates (labeled by Greek indices 
/j,, v). A TF was designated active, (+1), if its expres- 
sion level in the corresponding microarray was above the 
threshold and inactive, (—1), otherwise (Fig. 2C). These 



binary (i.e. on/off) TF data are the only biological in- 
put into our model. We restricted our considerations to 
TFs due to their importance in cellular reprogramming 
and differentiation. However, our model can be easily 
generalized to include other important genes. 

Motivation from attractor neural networks 

The Takahashi and Yamanaka reprogramming experi- 
ments [1] are reminiscent of content-addressable memory 
and attractor neural networks. A content-addressable 
memory allows one to retrieve a full memory based on 
sufficient partial information. To paraphrase the origi- 
nal Hopfield paper [17], if we want to recall a paper cita- 
tion, for example, "John J. Hopfield, Neural networks and 
physical systems with emergent collective computational 
abilities, 1982," a content-addressable memory allows us 
to recall the full citation with the partial recall "Hop- 
field Neural networks 1982" or other sufficient details. In 
the Yamanaka reprogramming protocol, overexpressing 
only four TFs is enough for a fibroblast to "recall" the 
global TF expression of an ESC. A content-addressable 
memory is naturally represented as a basin of attraction 
in a dynamical system, with partial recall corresponding 
to entering the basin of attraction and full recall corre- 
sponding to reaching the minimum of the basin. Hop- 
field attractor neural networks [17, 18, 20] are a general 
method to take an input set of vectors ( "memories" ) and 
explicitly construct a unique, global, landscape such that 
each input vector is a global minimum and has a basin 
of attraction. 

Mathematical model 

We now give a brief mathematical description of how 
we construct epigenetic landscapes. An arbitrary state 
of a cell is represented as a vector Si of length N — 
1152, with the ith component of the vector +1 if TF i 
is active and —1 if it is inactive. A cell fate /i = 1 ... 95 
is characterized by its binary TF expression vector, 
Our data set determine the £f and these are the only 
biological input into the landscape. 

A direct application of the original Hopfield neural net- 
work [17] fails for the real biological data set we con- 
structed. The original Hopfield method works for vec- 
tors whose components are random, independent vari- 
ables with equal probability of on (+1) and off (— 1) 
states. This leads to Gaussian noise between mem- 
ories [18]. However, cell fates are highly correlated 
(Fig. SI), as shown by the cell type correlation matrix 
j^iv _ l/NJ2i=i€i£i which characterizes the correla- 
tion between cell fate \x and v. These correlations lead 
to noise that globally destabilizes all basins of attrac- 
tion. The Hopfield construction can be generalized to 
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the projection-method which can incorporate correlated 
"memories" [19]. 

Mathematically, the high correlation between cell fates 
implies that the overlap (vector dot-product), — 

V-WS»=i £f &i, * s a P 00r measure of how close a cell 
with expression state Si is to a cell fate /x. Instead, the 
"projection" accurately characterizes correlated vectors 
and is given by a M = 'Y?v=i{A~ 1 ) llv m" , of Si on a cell 
fate fi. The projection measures the orthogonal projec- 
tion of a state Si onto the subspace spanned by naturally 
occurring cell fates, {£f } (see Fig. 2D and SI). 

We now explicitly construct the landscape using the 
potential energy (Lyapunov function), H, given by 

H = -\ E S ^ S i - y E * V 2 E SiK^Sj 

(1) 

The coupling constants = 

V^E^iELiC^ 1 )^ represent effective, 
correlation-based, interactions between TFs that may 
be indirect and give rise to stable basins of attraction 
around each cell fate. The x 1 * are an environment- 
dependent local field that can stabilize or destabilize 
a specific cell fate fj,. The -FQj(c) are differentiation 
couplings that switch between cell types and can depend 
on the environment through the external input c. Future 
work will explore the exact dependence of switching on 
the external environment. See Box 1 for technical de- 
scription, and see SI for more details of the Hamiltonian 
and a geometric interpretation. 

The dynamics of the network proceed by random, 
asynchronous updates [20] according to the probability 

p Phi(t)Si(t) 

P ft(' + 1 >] = e/»M0 +e -/»Mt) < 2 > 

with the local field on TF i given by hi = — §§:, and 
P an effective noise parameter that controls the level of 
stochasticity resulting from biochemical noise (see SI). 
This update time cannot be directly related to biologi- 
cal time. Instead, inspired by experiments showing that 
reprogramming rates scale with cell division rates [12], 
and the observation that cellular divisions produce HM 
errors [30] , we introduced an additional source of stochas- 
ticity into the dynamics by periodically flipping a fixed 
percentage of TF states to mimic cell division (see SI). 

RESULTS 

Cell fates are dynamic attractors that are responsive 
to signals 

We tested our model using two in silico experiments. 
To verify that naturally occurring cell fates are dy- 
namic attractors, we randomly perturbed the TF state 



of cells, Si, from the ESC state and then tracked the 
TF state over time. Fig. 2E shows the projection of 
the TF state on the ESC state as a function of time. 
For a large number of starting conditions, after an ini- 
tial transient, the system relaxes back to the ESC state 
(red bracket), explicitly demonstrating the existence of 
a large basin of attraction [7]. Originally, the interac- 
tions, J^, between TFs are symmetric, while biologi- 
cally realistic interactions should be non-symmetric, im- 
plying a non-Lyapunov pseudo- potential [31]. The sym- 
metry condition can be relaxed, for example by ran- 
domly deleting 20% of interactions (Fig. 2E Diluted). 
The cell fates remain robust attractors even for these 
pseudo-potentials. Our model can also deterministically 
switch between cell fates in response to differentiation 
signals. For example, the common myeloid progenitor 
(CMP) is a blood cell fate that in vivo can differenti- 
ate into either granulo-monocytic progenitors (CMP) or 
megakaryocyte-erythroid progenitors (MEP). In Fig. 2F, 
we show in silico experiments where we start the system 
in the CMP state and show the trajectories after apply- 
ing either the CMP (signal 1, blue) or MEP (signal 2, 
red) differentiation signal, resulting in branching to two 
distinct cell fates. 

Partially reprogrammed cells as "spuirous" 
attractors 

Partially reprogrammed cells have a natural interpreta- 
tion in our model as spurious attractors arising from the 
high dimensionality (N = 1152) of the state space [20]. 
These spurious attractors are guaranteed by topology of 
high-dimensional vector spaces and can be interpreted as 
potential cell fates that do not occur in vivo. Naively, 
one would expect partially reprogrammed cells to have 
minimal projection on natural cell fates since in high- 
dimensional vector spaces, any two random vectors are 
orthogonal (Fig. S2). The literature on neural networks 
[19] specifically predicts that in our model, spurious at- 
tractors should be hybrid cells that co-express genes from 
multiple cell fates, not necessarily including the begin- 
ning or ending cell fate. Reanalyzing all existing genome- 
wide data sets on partially reprogrammed cells (Table 
1) validates our prediction. The purity of the partially 
reprogrammed cell colonies is important because a het- 
erogenous sample could mimic hybrids. However, nearly 
all studied partially reprogrammed cells grew as homoge- 
nous colonies. Therefore, the limited experimental data 
does not support the idea of partially reprogrammed cells 
being explained by cell culture heterogeneity. In addi- 
tion, reexamination of a claimed iPSC-to-NSC conversion 
[32] shows that the resulting cell fate is more accurately 
characterized as a partially reprogrammed cell. This il- 
lustrates how the techniques developed here can be used 
to improve the classification of reprogrammed cells. 
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TABLE I. Partially reprogrammed cells as spurious at- 
tractors. Partially reprogrammed cell lines (first column) 
and their significant projections (i.e 2 std above noise) onto 
"natural" cell fates based on microarray data. The last col- 
umn indicates whether partially reprogrammed cell lines were 
homogeneous. The iPSC-NSC was originally classified as an 
NSCs [32] but is more accurately characterized as a partially 
reprogrammed hybrid of NPCs and NSCs. Abbreviations: 
iPSC, induced pluripotent stem cell; NSC, neural stem cell; 
NPC, neural progenitor cell; ESC, embryonic stem cell; MEF, 
mouse embryonic fibroblast; MPP, multi-potent progenitor. 



Cell line 


Start 


Goal 


Highest projecting states (projection) 


Homogeneous? 


1A2 [23] 


MEF 


ESC 


MEF (0.187), ESC (0.163), muscle-smooth (0.126) 


Yes 


1B3 [23] 


MEF 


ESC 


MEF (0.161), ESC (0.149) 


Yes 


BIV1+ [24] 


B Cell 


ESC 


melanocyte (0.1548), MPP (0.136) 


No info 


BIV1- [24] 


B Cell 


ESC 


ESC (0.434), pro-mesoderm (0.162), 
MPP (0.153), neural crest SC (0.127) 


No info 


MCV6 [24] 


MEF 


ESC 


ESC (0.264), MPP (0.145), 
pre-erythroid (0.133), smooth muscle (0.132) 


Yes 


MCV8 [24] 


MEF 


ESC 


ESC (0.180), MPP (0.160), melanocyte (0.143) 
large intestine (0.135) 


No 


iPSC-NSC [32] 


iPSC 


NSC 


NPC (0.428), NSC (0.281) 


Yes 



Identifying transcription factors for cellular 
reprogramming 

Our landscape model also provides a quantitative 
method to identify candidate TFs for reprogramming. 
First, recent experiments provide evidence that repro- 
gramming TFs should be based only on final, not initial, 
cell fates [22]. Second, intuitively, reprogramming candi- 
dates should be both highly expressed and highly "pre- 
dictive" of the desired cell fate. Since TF expression lev- 
els are well- fit by a log-normal distribution (Fig. S3), the 
log-normal z-score naturally defines high and low TF ex- 
pression levels. Within our landscape, the "predictivity" 
of a TF, for a given cell fate, is measured by its con- 
tribution to the potential energy of that cell fate. This 
"projection-contribution" is mathematically represented 

as r/f = l/NYZ^A- 1 )^^ for TF i in cel1 fate M- To 
obtain a single quantitative rank, we can multiply the 
z-score by the projection-contribution. 

We validate our candidate reprogramming TFs by 
comparing to existing protocols. Fig. 3 shows every TF's 
projection-contribution versus z-score in ESC, heart (car- 
diomyocytes) , and liver (hepatocytes) . Note that neu- 
rons were not included due to lack of compatible mi- 
croarray data. ESC are examined in Fig. 3A and Fig. 
3B. We have explicitly labeled TFs from existing repro- 
gramming protocols [21, 22]. As expected, these TFs are 
both predictive and highly expressed (upper right hand 
corner of graphs) . To check the biological validity of our 
predictions (SI for details), we analyzed the GO Anno- 



tation of our top 100 candidates for ESC reprogramming 
(Table SI). Within these top TFs, 9 have successfully 
been used in reprogramming, 8 are known pluripotcncy 
TFs (involved in maintaining stem cell fate), while 32 
have no known function as of yet and are intriguing re- 
programming candidates. 



ESCs are unusual in that the highly expressed TFs 
are also among the most predictive (e.g. Pou5fl/Oct4, 
Nanog). The true benefit of using both z-score and 
projection-contribution is best demonstrated by exam- 
ining reprogramming protocols to heart (Fig. 3C) and 
liver (Fig. 3D) (Table S2). For example, in heart (car- 
diomyocytes) reprogramming [3], Gata4 is ranked 68th in 
z-score TF expression, but our combined ranking of z- 
score and projection-contribution ranks Gata4 as 6th. 
Similar results are seen in liver (hepatocyte) protocols 
[4, 5] (see SI). 



In most cell fates, the highly expressed TFs arc not 
necessarily predictive, suggesting landscapes may be use- 
ful for rationally-designing reprogramming protocols to 
novel cell fates. Using our landscape model, we have 
identified the top 100 candidates for overexpression and 
as well as the top 100 candidates for knockouts for all 
cell fates where we have reliable data (see SI). Besides 
being candidates for reprogramming, the predictive TFs 
can be used as markers for each cell fate. 
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DISCUSSION 

Our work suggests that epigenetic landscapes are a 
powerful paradigm for understanding cellular identity 
and reprogramming. Remarkably, despite having no free 
parameters or biological knowledge beyond binary TF 
states of cell fates, our landscape models can explain par- 
tially reprogrammed cell fates and identify TFs used in 
cellular reprogramming protocols. 

The epigenetic landscapes of cellular identity con- 
structed in this paper are related to but distinct from 
the standard Waddington landscape for development [8] . 
In both landscapes, cell fates are viewed as valleys of 
a potential. However, Waddington's landscape is con- 
cerned with how developmental signals give rise to differ- 
ent cell fates. Hence, the axis of the original Wadding- 
ton landscape can be interpreted combination of 
time, signals, as well the state of molecular components 
such as transcription factors. In more modern language, 
Waddington's landscape is concerned with the geometry 
of bifurcations in space of possible developmental sig- 
nals (see [33] for an example). In contrast, in the land- 
scapes considered in this paper the state space is com- 
posed solely of the epigenetic state of transcription fac- 
tors and developmental signals, c, couple to the landscape 
through the signal-dependent parameters Kij(c). 

Currently, our model has several limitations centered 
around dynamics. First, for ESC reprogramming, our 
model docs not highlight the importance of the non- 
specific transcription factor Myc (see SI for a detailed 
discussion). Second, as mentioned previously, our use of 
asynchronous dynamics cannot be directly related to bi- 
ological time. Lastly, our landscape does not accurately 
reflect the dynamics of reprogramming. Simulations of 
reprogramming with known protocols, such as the Ya- 
manaka protocol, lead to rates of reprogramming that 
are comparable to the rates from a reprogramming sim- 
ulation with a randomly selected protocol. This is likely 
due to the fact that cell fates are extremely stable and 
hence reprogramming is extremely rare and hence hard 
to stochastically sample. 

Our model has several possible extensions. Following 
previous work on neural network [34], our landscape can 
be generalized to continuous TF expression levels. Addi- 
tionally, our landscapes can provide a multitude of pre- 
dictions given the correct data. Our framework can easily 
be generalized to include microRNAs, other genes, or the 
human epigenetic landscape given appropriate data sets. 
This opens up possibilities of improving upon the high re- 
programming rates achieved by overexpressing microR- 
NAs [35] or synthetic mRNAs [36]. Another attractive 
clement of the framework presented here is that it allows 
for a quantitative analysis of whole genome- wide expres- 
sion states (see Table 1). This is likely to yield a more 
accurate classification of reprogrammed cells and allow 



for the classification as well as the identification of dis- 
eased cell fates. Finally, our epigenetic landscape may 
also prove useful for designing more efficient directed dif- 
ferentiation protocols [37] . Overall epigenetic landscapes 
provide a unifying framework for cell identity, reprogram- 
ming, and directed differentiation. 

MATERIALS AND METHODS 

All microarray data utilize the Affymetrix GcncChip 
Mouse Genome 430 2.0 platform and were downloaded 
using Array Express (www.ebi.ac.uk/arraycxprcss). The 
details of all 393 microarrays can be found under ac- 
cession GSE (to be determined). Microarray probe-to- 
gene map was created with Bioconductor 2.10. All raw 
microarray files were processed in one batch by robust 
mean averaging (RMA) in MATLAB. Since we were in- 
terested in cellular identity, only transcription factors, 
transcription factor co-factors, or chromatin remodeling 
genes were kept (for short hand, referred to as transcrip- 
tion factors (TF) throughout the text) [38]. 

Ideally, we would like the global histone modification 
(HM) state of all cell fates since these arc the primary in- 
put into our model. However, global HM data are limited 
[28, 29]. Consequently, we used the global HM data for 
these three cell fates and compared them to microarray 
TF expression levels. This allowed us to create a con- 
ditional probability distribution of each HM for a given 
TF expression level (Fig. 2B). We found a sharp cut- 
off (ss 5.5) which distinguished TFs with the activating 
modification of histone 3 tri-methylation at lysine 4 (K4) 
from TFs with the inactivating modification of histone 3 
tri-methylation at lysine 27 (K27), poised/bivalent TFs 
(both K4 and K27), and no HM (most likely DNA methy- 
lation) . 

This conditional probability distribution allowed us to 
create a binary expression state for each cell fate from 
our microarray data. TFs with expression above 5.5 were 
designated on, (+1), while TFs below 5.5 were designated 
off, ( — 1). The conclusions presented in the paper are 
robust to the threshold choice (not shown, but similar 
conclusions reached for a cutoff of 5 or 6). See SI for full 
details. 
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Box 1. Overview of our quantitative cellular identity landscape 

N transcription factors (TF) labeled by Latin indices i and p cell types labeled by Greek indices fx. Each TF is 
either on (+1) or off (—1). 

A general network state is represented by a vector Si of length N. A cell type jj, is represented by the vector 

We require all £f to be attractors in the landscape. This is ensured by constructing a correlation-based interaction 
network 

J «4tt^"'re ( 3 ) 

fj, = l v = l 

with the correlation matrix between cell types: 

1 N 

i=i 

The commonly used overlap order parameter is the "magnetization" 



™»"4e^- ( 5 ) 



i=l 



However, cell types are highly correlated with each other. Thus, a more accurate order parameter is the "projection" , 
a M , on each cell type 



v 



a- = ^(A-'rm". (6) 

• The specific "projection-contribution" of a given TF i to the projection on cell type /x is given by 

< = ^E(^ _1 rer (7) 
i/=i 

• Cell types may be stabilized by external conditions (such as growing in a favorable culture media). For example, 
cell type fi is favored by x MflM > where \ characterizes the degree of the stabilization. 

• External signals, c, can induce switching between cell types and is represented by the matrix Kij(c). See SI for more 
detail. 

• The landscape is defined by function H. Term 1 is an energy, or Lyapunov function, that generates the basins of 
attraction. Term 2 stabilizes specific cell types (ie culture term). Term 3 is non-Lyapunov and represents switching 
due to external signaling. 

H = -\H s i J » s > - y E * v - \ E E SiK^Sj. (8) 

ij (J ij )iv 

• TFs can stochastically switch states. A TF is biased towards a state by its interactions with the network through 
its local field hi = The dynamics are stochastic and controlled by a global noise parameter /3. At each time 
step, one TF is updated with the probability of state Si(t + 1) at time t + 1 related to the state Si(t) and local field 
hi(t) at time t by 

pig- (t + 1)1 - Exp[ph t (t)S^)] 



A, Minimal Landscape 



B, Cell Stabilized by Environment C. Cell Switch due to External S 





FIG. 1. Phenotypic Landscape. These landscapes envi- 
sion cell fates as attractors in a signal-dependent landscape. 
(A) The minimal cellular identity landscape. Each cell fate is 
a basin of attraction (black circles) . Reprogramming between 
different cell fates (1 and 2) can occur probabilistically via 
different trajectories (black paths). Partially reprogrammed 
cells (PRC) exist as smaller, spurious, basins of attraction 
(red circle) that can be experimentally observed by repro- 
gramming experiments (example trajectory in red). (B) Same 
cellular identity landscape in the presence of a stabilizing en- 
vironment (ex. favorable culturing medium) for cell fate 2. 
The environment increases the radius and depth of the cell 
fate 2 basin of attraction. (C) Landscape in the presence of 
an external signal that gives rise to differentiation from cell 
fate 1 to cell fate 2 (ex. growth factors associated with differ- 
entiation). Notice the low energy path between the cell fates 
that drives switching from cell fate I to cell fate 2. 




FIG. 2. Overview of model. (A) Histone 3 tri-methylation 
at lysine 4 (K4) is associated with active genes, while his- 
tone 3 tri-methylation at lysine 27 (K27) is associated with 
repressed genes. (B) Conditional probability distribution of 
histone modification (HM) given transcription factor (TF) ex- 
pression levels derived by comparing microarray data with 
HM data from [28, 29]. Notice the sharp threshold (black 
line) between expression levels of active and inactive TFs. 
(C) Using (B), continuous TF expression levels is converted 
into binary states. (D) An arbitrary state is represented by 
a vector S of ±1, with each dimension in the vector space 
representing the state of a TF. The natural cell fates form 
a subspace (gray plane). The landscape model is based on 
the orthogonal projection of the TF state onto this subspace. 
(E) The dynamics of the landscape model for different ini- 
tial conditions for a fully connected interaction matrix Jij 
and a diluted interaction matrix where 20% of interactions 
have been randomly deleted. Plot shows the projection of S 
on embryonic stem cells (ESC) as function of time. Notice 
the large basins of attraction (red bracket). (F) Simulations 
showing how a common myeloid progenitor (CMP) can dif- 
ferentiate into either granulo-monocytic progenitors (CMP) 
or megakaryocyte-erythroid progenitors (MEP) in response 
to two distinct external signals. 
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FIG. 3. Identifying reprogramming candidates. For a 

given cell fate, we plot every transcription factor's (TF) pre- 
dictivity (aka energy projection-contribution, 77^) vs TF ex- 
pression level. Gray TFs are within the the uncertainty intro- 
duced by making data discrete (binary). All reprogramming 
TFs are in a pre-existing protocol. Boxed TFs are potential 
reprogramming candidates since they have both high expres- 
sion level and high projection-contribution. (A) Embryonic 
stem cell graph (ESC). (B) Zoom in of box in (A). TFs used 
in known reprogramming protocols are labeled [21, 22]. (C) 
Heart. Labeled TFs can reprogram fibroblasts to cardiomy- 
ocytes [3]. (D) Liver. Labeled TFs can reprogram fibroblasts 
to liver cells. One protocol used Hnf4a plus any of Foxal, 
Foxa2, or Foxa3 [4] while another used Gata4, Foxa3, Hnfla, 
and deletion of pl9Arf [5]. Hnfla and pl9Arf were not differ- 
entially expressed in our microarrays and are not shown. 
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I. DATA SET CONSTRUCTION 

All microarray data utilize the Affymetrix GeneChip Mouse Genome 430 2.0 platform and were downloaded using 
ArrayExpress (www.ebi.ac.uk/arrayexpress). The details of all 393 microarrays can be found under accession GSE 
(to be determined). Microarrays are all control or natural conditions. Microarray probe-to-gene map was created 
with Bioconductor 2.10. All raw microarray files were processed in one batch by robust mean averaging (RMA) in 
MATLAB, and genes with multiple microarray probes were averaged. At this point, the data set consisted of 393 
microarrays with 20877 genes. Since we were interested in cellular identity, only transcription factors, transcription 
factor co-factors, or chromatin remodeling genes were kept (for short hand, referred to as transcription factors (TF) 
throughout the text) 1 , leaving 1612 TFs. 

Ideally, we would like the global histone modification (HM) state of all cell fates since these are the primary input 
into our model. However, global HM data are limited to embryonic stem cells (ESC), mouse embryonic fibroblasts 
(MEF), and neural progenitor cells (NPC) 2 ' 3 . Consequently, we used the global HM data for these three cell fates and 
compared them to microarray TF expression levels. This allowed us to create a conditional probability distribution 
of each HM for a given TF expression level (Fig. 2B). We found a sharp cutoff (sa 5.5) which distinguished TFs with 
the activating modification of histone 3 tri-mcthylation at lysine 4 (K4) from TFs with the inactivating modification 
of histone 3 tri-mcthylation at lysine 27 (K27), poised/bivalent TFs (both K4 and K27), and no HM (most likely 
DNA methylation). 

This conditional probability distribution allowed us to create a binary expression state for each cell fate from our 
microarray data. TFs with expression above 5.5 were designated on, (+1), while TFs below 5.5 were designated off, 
(— 1). The conclusions presented in the paper are robust to the threshold choice (not shown, but similar conclusions 
reached for a cutoff of 5 or 6). After the binarization of TF expression, all TFs that were not differentially expressed 
across cell fates (i.e. TFs that are always on / always off in every cell fate) were dropped, leaving 1152 TFs. The 
binarized TF expression for the 95 cell fates was found by first binarizing all 393 microarrays and then taking the 
majority vote for each cell state (with ties broken by averaging the continuous data). The final result was the binary 
expression state for 95 cell fates and 7 partially reprogrammed cell fates with 1152 TFs. 

Several self-consistency checks were performed on the data. First, the correlation matrix A^ v (explained in main 
text and below) was calculated for the original continuous data and for the binarized data (Fig. SI). Both correlation 
matrices are consistent with each other showing binarization does not change the global correlations. Note that in 
the correlation matrix, cell fates have been grouped by tissue type, leading to a block diagonal form. Second, the TF 
expression is well fit by a log-normal distribution (Fig. S3), and the mean of this distribution (5.35) is consistent with 
the threshold (5.5). Third, the expression state of all cell fates was constructed from multiple microarray experiments. 
These different experiments were compared with each other and were within 2 standard deviations for all cell fates 
(see Fig. S2 for definition of std). This demonstrates that microarrays from multiple laboratories can be directly 
compared. 
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A complete list of microarrays used in this study is included as an Excel file. In addition, the processed data 
(continuous and binary) are also included as an Excel file. 

II. OVERVIEW OF MODEL 

A. Epigenetic landscapes from the projection method 

Here we expand upon the explanation of the quantitative model from the main text. The state of a cell is represented 
as a vector St of length TV = 1152, with each dimension of the vector (labeled by Latin indices i, j) corresponding 
to the binary TF expression, either +1 if TF i is active and —1 if it is inactive. There are p — 95 cell fates (labeled 
by Greek indices /i, v) and cell fate /i is a N by 1 vector characterized by its binary expression state, The TF 
expression for each cell fate, is given by our microarray dataset. This is the only input of biological data in our 
landscape. 

In general, cell fates are highly correlated (Fig. SI), and it is useful to dchne a correlation matrix A^ v = 
V-^Sili ii^i which characterizes the correlation between cell fate /i and v. The high correlation between cell 
fates implies that the overlap (vector dot-product), = l/^S2=i £f ^ s a P 00r measure of how close a cell 
with expression state Si is to a cell fate \i. A more accurate characterization is provided by the "projection", 
= Y^v=i(A~ 1 Y'' 'my , of Si on a cell fate \i. The projection measures the orthogonal projection of a state Si onto 
the subspace spanned by naturally occurring cell fates, {£f } (see Fig. 2D). The "projection-contribution" is math- 
ematically represented as rft = 1/N S^iX^ -1 )^^ f° r TF i in cell fate /i. This can be interpreted as the energy 
contribution of TF i to cell fate /i. This is also a measure of the "predictivity" of a TF for a given cell fate. 

Using the state space, we can explicitly construct the landscape, H, as: 

H = -\ £ SiJijSj SiKijtfSj (1) 

i,3 i i-,3 

with the dynamics set by random, asynchronous update according to the probability 

0hi(t)Si(t) 

P[Si(t + l)} = e , hi(t)+e _ 0hi{t) (2) 

with the local field on each TF given by hi — — |^ . 

The Hbasin = Si j SiJijSj produces stable basins of attraction (Fig. 1A). This is done by inferring a 

correlation-based, TF interaction matrix 4 

^L — 1 U— 1 

with effective interactions that may be indirect. 

This stabilizing term of the landscape can be rewritten in terms of the overlap and projection as follows: 

H basin = -^SiJi^ = -±j2J2 s ^( A ^ u ^ s 3 - -f E mAVI w 

A simple geometric picture illustrates that Hbasin — — 1/2 Si j SiJijSj makes each cell type a global minimum 
of the landscape. An arbitrary vector can be rewritten in terms of its projection in the cell fate subspace and its 
orthogonal component 6 Si 4 , 

Si = J2 a ^i+ 6S i ( 5 ) 
Then, the distance of an arbitrary vector S to the cell fate subspace is given by A, 



A = Q>S 4 ) 2 ) 1/2 

i 



(6) 
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which can be rewritten as 

N 



(7) 



This allows us to rewrite the stabilizing term of the landscape as 

Hbasin = - y + 

This provides a very clear interpretation of the landscape as the global distance of an arbitrary vector S to the 
natural cell fate subspace 4 . 

To demonstrate cell fate stability, an ESC state for the fully connected interaction matrix was randomly perturbed 
(Fig. 2E, Fully Connected) and evolved in time. For a variety of initial conditions, it relaxed back to the ESC, 
demonstrating a large basin of attraction (red bracket). While our original construction is a true potential or Lyapunov 
function, this condition can be relaxed. Random dilution (Jy = 0) creates a pseudo-potential or non-Lyapunov 
function, which has a smaller, noisier basin of attraction, see (Fig. 2E, Diluted) where 20% of Jjj were removed. 
Note, we show the trajectories of the diluted interaction matrix to demonstrate the robustness of the model, but 
further experimental data is needed to introduce biologically realistic (i.e. not random) dilution. 

The H cu iture = — 1/2^ JQSi stabilizes specific cell fates (Fig. IB). This can be explicitly shown as: 

HcuUure = ~\ £ M = -y £ X^S, = "y£ X^ (9) 
i i i 

where cell fate fj, is favored x M which characterizes the degree of environmental stabilization 5 . 

The H switc h = Xli j SiKij(c)Sj represents switching due to external signals c (Fig. 1C). The original inter- 

pretation as an interaction in TF space can be written in terms of cell fates as 

H switch = —Y^SiK^Sj = -^EE^(^^ = _^£ m "G"W (10) 

where G^ v characterizes the switching from the switching from cell fate v to cell fate /i and in general is a complicated 
function of external signals c. For now the c is included to show where external signals couple to switching, but more 
work is needed to determine the explicit interactions between the environment and cell fate switching. 

We demonstrate this term with a biological example. The common myeloid progenitor (CMP) is a blood cell fate 
that in vivo can differentiate into either granulo-monocytic progenitors (GMP) or mcgakaryocyte-erythroid progenitors 
(MEP). We demonstrate (Fig. 2F) that signal 1 (red), 

qGMP,CMP _ i j--^ 

(with rest of G^ v — 0, i.e. signal 2 absent) can convert CMP to GMP, while a separate application of signal 2 (blue) 

qMEP,cmp _ ^ ^22) 

(with rest of G^" — 0, i.e. signal 1 absent) can differentiate MEP to GMP. These switching terms are explicitly 
non-Lyapunov because they break symmetry; for example, signal 1 switches CMP to GMP, but not GMP to CMP. 

The standard dynamics is random, asynchronous update 5 . The interaction network biases a TF to its state by 
its local field hi — — §§:■ The dynamics are stochastic and controlled by a global noise parameter /?, which is the 
inverse temperature, j3 = 1/T. At each time step, one TF is probabilistically updated based only on its local field, 
hi, and the global noise, j3. The update time has no physical meaning, while the biologically relevant time scale is 
cell division 6 . Since cellular division produces HM errors 7 , a physical time scale is introduced by periodically flipping 
a fixed percentage of TF states. The system is still stable under this additional noise, as shown in Fig. 2E which 
incorporates both asynchronous dynamics with (3 = 1/0.45 = 2.2 and 2% errors every 5000 time steps. The noise due 
to the temperature (3 is equivalent to partial annealing or smooth fluctuations about the landscape while the bursts of 
noise due to cell division is equivalent to quantum annealing or jumps through phase space which can tunnel through 
barriers in the landscape. 
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B. Model in matrix notation 



For completeness, we again present the model above but now in explicit matrix notation. S is a N by 1 vector and 
£ is a p by N matrix that is based on biological data. The correlation matrix is p by p 



The overlap and projection are both p by 1 vectors 



a = (is) 



m=^S (14) 



a = A- 1 m=±A- 1 ZS=(tt T rHS (15) 



and the projection-contribution is a p by N matrix 



n = = (ft r )-^ (16) 

The TF interaction matrix that creates stable basins of attraction a N by N matrix 

J=±?A- 1 Z = f(t?)- 1 t (17) 

One can verify that this is a projection matrix since J 2 = J. The culture stabilization term in TF space is a N by 1 
vector given by X . In cell fate space it is represented as X — Nr/ T x T ■ 

The TF interaction matrix that induces switches between cell fates is a TV by N matrix 

K = j f eG V =eG(tt T r 1 !; as) 



where the cell fate switching matrix G is a p by p matrix. 
The landscape can be written in TF space as 



H = --S T JS--X T S- 1 S T KS (19) 
2 2 2 v ' 



Or the landscape can be given in terms of cell fates as 



N T N T N ~ , N 

H = - -m T a - - X T a - -m T Ga (20) 



C. Partially reprogrammed cells as spurious attractors 



One of the most generic properties of all attractor neural network constructions is that in addition to the desired 
attractors, £f , the non-linearity of the dynamical process and the high dimensionality of the underlying space induces 
additional attractors, which are termed spurious attractors 5 . In general, these spurious attractors have higher energy 
than the stored patterns, and hence smaller basins of attractions. For the traditional Hopficld model, these 
spurious attractors take the form of odd-hybrids (i.e. hybrids of 3,5,7,... of the £f ) 5 . However, for the projection 
method any hybrid of £f is a spurious attractor of the system 4 . This can be easily understood by noting that in the 
projection method, the entire subspace spanned by the £|* are spurious attractors (see geometric picture above). 

As discussed in the main text, the prediction of spurious attractors in the projection method inspired us to reexamine 
data on existing partially reprogrammed cells. Surprisingly, we found that partially reprogrammed cells could be 
thought of as hybrids of existing cell fates. While in this paper we use discrete states, Hopfield 8 has shown that the 
landscape construction can be generalized to continuous, sigmoidal states. This means that a continuous Hopficld 
model effectively interpolates continuously between a set of discrete states. The agreement between theory and 
experiment provides hints that the fundamental units of cellular identity are discrete states that are read out in a 
continuous manner, suggesting the tantalizing possibility that the discreteness of histone modifications (HM) is an 
important component of the networks that underly cellular identity. 
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D. Predicting candidate TFs for reprogramming 

Our candidate TFs for overexpression should have high projection-contribution ( t] > 0) and high expression 
(zscore > 0.5). Note that conversely, negative projection-contribution ( rj < 0) and low TF expression (zscore < —0.5) 
indicates candidates for knockouts in reprogramming. TF within 0.5 std of the binary threshold are excluded since 
their projection-contribution can change significantly with changes in the threshold (data not shown). Within TFs 
that satisfy the above criteria, we can rank reprogramming candidates by a single number by multiplying projection- 
contribution by z-score TF expression. 

See the SI file TF Reprogramming Candidates for our top 100 candidates for overexpression and our top 100 
candidates for knockout for a variety of cell fates. We only made predictions for cell fates that are homogeneous. 
For example, most of our neural data is based on dissection of the brain, and hence is a heterogenous mixture of 
various neurons and other cell fates. To make predictions for various types of neurons, homogenous microarrays of 
each neuron type are needed. 

In the SI file Processed Data, all data used for the predictions can be found. Another type of prediction can be 
made from the data. For example, we have microarrays for multiple B cell precursors, pre B cells, pre pro B cells, 
and pro B cells. However, if one is interested in reprogramming to any of the B cell precursors, since our predictions 
are based on linear algebra, the ranking for all three precursors can be averaged to produce rankings for a general B 
cell precursor. 



E. Limitations of model 



One limitation of our model is that it misses the importance of the oncogene Myc to reprogramming. Many protocols 
use Myc 9 , but it can be replaced (with no deleterious effect) by short hairpin RNAs (shRNAs) 10 , or dropped completely 
from protocols at the expense of speed and less efficient reprogramming 11 . Thus, Myc appears to increase the rate 
(not outcome) of reprogramming while our current framework has limited information about dynamics. 

Another limitation is the use of Affymetrix GeneChip Mouse Genome 430 2.0 platform. This microarray is useful 
since there exists much public data on a variety of cell fates. However, upon closer examination of one cell fate, lung, 
we discovered the inaccuracy of this microarray. Nkx2-1 is known to be a marker of early lung development 12 , yet 
the Affymetrix GeneChip Mouse Genome 430 2.0 shows no differential expression of Nkx2-1. Therefore, our current 
data may miss key TFs due to a poorly matched probe. For future research, more accurate predictions can be made 
using more modern microarrays such as Affymetrix Mouse Gene 1.0 ST or RNA-Seq data. 
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A. Continuous Cell Type Correlation Matrix 



B. Binarized Cell Type Correlation Matrix 
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FIG. 1: Cell fate correlation matrices. (A) Correlation matrix between cell fates for continuous data. (B) 

Correlation matrix for binarized data. 
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FIG. 2: Projection of a random vector on a given cell fate. Ten thousand binarized random vectors were 
created in MATLAB and projected onto the cellular sub-space. The histogram shows the distribution of the 
projections. The red line is a Gaussian fit to the histogram. The mean is practically zero while the standard 

deviation is 0.0629. 
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FIG. 3: TF Expression is Log-Normal. The figure shows the log-normal z-score with zero corresponding to the 
the previously determined binary cutoff point (5.5 in raw expression value) instead of the log-normal mean (5.35 in 

raw expression value). 
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TABLE I: Classifying ESC Reprogramming Candidates. Justification for transcription factor (TF) classification in 
Figure 3A and 3B. Table has top 100 embryonic stem cell (ESC) reprogramming candidates (as ranked by z-score 

times predictivity, 77^ ) . Classification of each TF is either justified by paper citation or GO Process term. 
Reprogramming TFs are in a pre-existing reprogramming protocol, pluripotency TFs help maintain the ESC state, 
differentiation TFs are expressed in ESC but help induce cell fate change in vivo, and unknown TFs do not yet have 
a known function and hence are intriguing reprogramming candidates. Note that reprogramming TFs Myc, Chdl, 

and Ezh2 not in our top 100 candidates. 



TF 


Z-Scorc 


itf (10- 3 ) 


Z*r?f (10- 3 ) 


Classification 


Citation or GO Term 


Pou5fl 


3.31 


2.32 


7.68 


Reprogramming 


9 


Zfp819 


1.38 


4.96 


6.86 


Unknown 


biological process 


Gbx2 


1.33 


5.01 


6.68 


Differentiation 


midbrain-hindbrain 
boundary development 


Nanog 


2.90 


1.94 


5.62 


Reprogramming 


9 


Prdml4 


0.97 


5.01 


4.84 


Pluripotency 


13 


Oligl 


1.06 


4.43 


4.69 


Differentiation 


neuron fate commitment 


Foxd3 


0.74 


6.10 


4.54 


Pluripotency 


14 


Zic3 


2.32 


1.95 


4.52 


Differentiation 


cell differentiation 


Msc 


0.98 


4.43 


4.33 


Differentiation 


branchiomeric skeletal 
muscle development 


NrObl 


2.44 


1.71 


4.18 


Pluripotency 


negative regulation 
of cell differentiation 


Glil 


1.21 


3.40 


4.11 


Differentiation 


lung development 


Tcfl5 


1.96 


2.07 


4.07 


Unknown 


regulation of transcription, 
DNA-dependent 


Zfp936 


1.99 


1.94 


3.86 


Unknown 


biological process 


Zfpl05 


1.66 


2.30 


3.82 


Unknown 


regulation of transcription, 
DNA-dependent 


Neurodl 


0.82 


4.45 


3.64 


Differentiation 


positive regulation 
of cell differentiation 


Rex2 


1.20 


2.99 


3.58 


Unknown 


biological process 


Pir 


1.72 


2.08 


3.57 


Differentiation 


monocyte differentiation 


Stat4 


0.88 


4.06 


3.56 


Signaling 


signal transduction 


Bcl3 


1.54 


2.30 


3.55 


Differentiation 


marginal zone 
B cell differentiation 


Utfl 


1.26 


2.70 


3.40 


Reprogramming 


9 


Esrrb 


1.71 


1.97 


3.38 


Reprogramming 


9 


Gml3152 


2.50 


1.34 


3.35 


Unknown 


biological process 


Otx2 


1.17 


2.71 


3.17 


Differentiation 


cell fate specification 


Zfp42 


2.77 


1.14 


3.16 


Pluripotency 


15 


Trip6 


1.79 


1.60 


2.87 


Other 


regulation of transcription, 
DNA-dependent; cell adhesion 
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TF 


Z Score 


v (io- 3 ) 


Z*r/ (10~ 3 ) 


Classification 


Citation or GO Term 


Sycp3 


1.35 


2.01 


2.72 


Differentiation 


spermatogenesis 


Zfp568 


0.85 


3.13 


2.66 


Differentiation 


convergent extension involved 
in neural plate elongation 


Trpsl 


0.58 


4.53 


2.62 


Unknown 


negative regulation of 
transcription, DNA-dependent 


Salll 


1.51 


1.70 


2.57 


Differentiation 


neural tube development 


2610305D13Rik 


2.09 


1.22 


2.56 


Unknown 


biological process 


Tfap2c 


1.51 


1.70 


2.56 


Differentiation 


cell differentiation 


Zfp423 


1.25 


2.04 


2.54 


Differentiation 


cell differentiation 


Gli2 


1.78 


1.37 


2.45 


Differentiation 


cell differentiation 


Gml3212 


0.95 


2.56 


2.42 


Unknown 


biological process 


Htatip2 


0.53 


4.55 


2.41 


Differentiation 


cell differentiation 


Arid5b 


0.93 


2.55 


2.36 


Differentiation 


adipose tissue development 


Klf4 


2.47 


0.95 


2.35 


Reprogramming 


9 


Dnmt3b 


1.97 


1.19 


2.33 


Epigenetics 


DNA methylation 


Aire 


1.30 


1.78 


2.32 


Immune 


humoral immune response 


Zfp57 


1.56 


1.46 


2.28 


Epigenetics 


DNA methylation 
involved in embryo development 


Pbrml 


2.00 


1.04 


2.09 


Differentiation 


heart development 


Tgifl 


2.56 


0.81 


2.07 


Differentiation 


positive regulation 
of neuron differentiation 


Soxl5 


1.23 


1.68 


2.06 


Pluripotency 


15 


Zic5 


1.09 


1.89 


2.05 


Differentiation 


cell differentiation 


Zfp473 


1.72 


1.18 


2.03 


Unknown 


regulation of transcription, 
DNA-dependent 


5730507C01Rik 


1.44 


1.37 


1.96 


Unknown 


biological process 


Foxpl 


1.04 


1.87 


1.95 


Pluripotency 


embryo development 


Zfp229 


1.57 


1.22 


1.92 


Unknown 


biological process 


Six4 


1.00 


1.90 


1.91 


Differentiation 


thymus development 


Dmrtl 


1.16 


1.64 


1.90 


Differentiation 


Sertoli cell differentiation 



10 



TF 


Z Score 


v (io- 3 ) 


Z*r? (10~ 3 ) 


Classification 


Citation or GO Term 


ZscanlO 


2.23 


0.84 


1.88 


Pluripotency 


stem cell differentiation 


Hhex 


1.06 


1.68 


1.78 


Differentiation 


B cell differentiation 


Cebpb 


0.78 


2.24 


1.74 


Differentiation 


neuron differentiation 


Eomes 


0.57 


3.01 


1.72 


Differentiation 


cell differentiation 


Foxhl 


1.15 


1.50 


1.72 


Differentiation 


axial mesoderm development 


Zbtb8a 


1.97 


0.87 


1.71 


Unknown 


regulation of transcription, 
DNA-dependent 


Hsf2bp 


1.06 


1.59 


1.69 


Unknown 


biological process 


Sall4 


2.41 


0.64 


1.55 


Reprogramming 


9 


Zfp955b 


1.49 


1.04 


1.55 


Unknown 


biological process 


Rara 


0.94 


1.64 


1.54 


Differentiation 


chondroblast differentiation 


Zfp934 


0.97 


1.59 


1.54 


Unknown 


biological process 


Rest 


1.58 


0.97 


1.54 


Differentiation 


cardiac muscle cell 
myoblast differentiation 


Tcfl5 


1.81 


0.84 


1.52 


Differentiation 


muscle organ development 


Zfp647 


1.42 


1.06 


1.50 


Unknown 


regulation of transcription, 
DNA-dcpendent 


Id3 


1.85 


0.81 


1.50 


Differentiation 


epithelial cell differentiation 


Zfp217 


1.81 


0.81 


1.46 


Unknown 


regulation of transcription, 
DNA-dependent 


Pitx2 


0.90 


1.60 


1.45 


Differentiation 


cardiac muscle cell differentiation 


Egrl 


1.90 


0.74 


1.40 


Differentiation 


T cell differentiation 


Id4 


0.93 


1.51 


1.40 


Differentiation 


cerebral cortex neuron differentiation 


Smarcal 


0.73 


1.91 


1.39 


Differentiation 


neuron differentiation 


Zfp799 


0.84 


1.66 


1.39 


Unknown 


biological process 


Tead2 


1.94 


0.72 


1.39 


Differentiation 


notochord development 


Usfl 


0.80 


1.72 


1.38 


DNA Repair 


response to UV 


Zfp760 


1.25 


1.10 


1.37 


Unknown 


biological process 


Plagll 


0.78 


1.75 


1.36 


Unknown 


regulation of gene expression 



11 



TF 


Z Score 


n (io- 3 ) 


Z*?? (10~ 3 ) 


Classification 


Citation or GO Term 


Sox2 


3.01 


0.44 


1.33 


Reprogramming 


9 


Hmgb2 


2.40 


0.54 


1.30 


Differentiation 


positive regulation 
of erythrocyte differentiation 


Rbpms 


1.79 


0.72 


1.29 


Unknown 


regulation of transcription, 
DNA-dcpendent 


Rhox6 


0.94 


1.38 


1.29 


Unknown 


biological process 


Grhll 


0.75 


1.71 


1.28 


Unknown 


regulation of transcription, 
DNA-dependent 


Prdm5 


0.83 


1.51 


1.25 


Epigenetic 


chromatin modification 


Gm5595 


1.05 


1.17 


1.23 


Unknown 


biological process 


Relb 


0.83 


1.48 


1.23 


Differentiation 


T-helper 1 cell differentiation 


Zbtbl2 


1.02 


1.20 


1.22 


Unknown 


biological process 


Sp5 


0.60 


2.03 


1.22 


Differentiation 


bone morphogenesis 


Lin28a 


1.99 


0.60 


1.19 


Reprogramming 


9 


Zfp553 


1.18 


0.99 


1.17 


Unknown 


regulation of transcription, 
DNA-dependent 


E2fl 


0.88 


1.32 


1.17 


Differentiation 


forebrain development 


Peg3 


0.76 


1.49 


1.14 


Apoptosis 


apoptotic process 


Zfp449 


0.62 


1.82 


1.13 


Unknown 


regulation of transcription, 
DNA-dependent 


Nr5a2 


0.73 


1.49 


1.09 


Reprogramming 


9 


Grhl3 


0.70 


1.56 


1.08 


Differentiation 


ectoderm development 


Nfya 


0.68 


1.59 


1.08 


Unknown 


positive regulation of transcription 
DNA-dcpendent 


Soxll 


0.78 


1.37 


1.07 


Differentiation 


embryonic skeletal 
system morphogenesis 


Cenpi 


1.52 


0.71 


1.07 


Unknown 


biological process 


Smadl 


1.29 


0.83 


1.07 


Pluripotency 


13 


Etv4 


0.82 


1.30 


1.06 


Differentiation 


stem cell differentiation 


Cenpn 


2.18 


0.48 


1.04 


Unknown 


biological process 


Zfp296 


1.98 


0.52 


1.03 


Unknown 


biological process 


Maff 


0.92 


1.12 


1.03 


Differentiation 


regulation of epidermal 
cell differentiation 



TABLE II: Comparison of rankings. Here we compare our ranking based on mulitplying z-score by 
projection-contribution versus the z-score only ranking. Note that in liver, Gata4 is within 0.5 std of the binary 

cutoff and therefore its projection-contribution is unreliable. 



TF 


Protocol 


Our Rank 


Z-score Rank 


Gata4 


Heart 


6 


68 


Tbx5 


Heart 


13 


129 


Mef2c 


Heart 


50 


38 


Foxa3 


Liver 


2 


13 


Foxa2 


Liver 


9 


36 


Hnf4a 


Liver 


10 


6 


Foxal 


Liver 


19 


48 


Gata4 


Liver 


No rank 


415 



